Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .
translated by 谷歌翻译
培训最先进模型所需的基础设施变得过于昂贵,这使得培训此类模型仅适用于大型公司和机构。最近的工作提出了几种协作培训此类模型的方法,即通过将许多独立方的硬件汇总在一起,并通过Internet培训共享模型。在此演示中,我们合作培训了类似于Openai Dall-E的文本到图像变压器。我们邀请观众加入正在进行的训练运行,向他们展示有关如何使用可用硬件贡献的说明。我们解释了如何应对与此类训练运行相关的工程挑战(缓慢的沟通,有限的内存,设备之间的性能不均和安全问题),并讨论了观众如何设置协作培训。最后,我们表明所得模型在许多提示上生成了合理质量的图像。
translated by 谷歌翻译
现代深度学习应用程序需要越来越多地计算培训最先进的模型。为了解决这一需求,大型企业和机构使用专用的高性能计算集群,其建筑和维护既昂贵又远远超出大多数组织的预算。结果,一些研究方向成为几个大型工业甚至更少的学术作用者的独家领域。为了减轻这种差异,较小的团体可以汇集他们的计算资源并运行有利于所有参与者的协作实验。这种范式称为网格或志愿者计算,在众多科学领域看到了成功的应用。然而,由于高延迟,不对称带宽以及志愿者计算独特的几个挑战,使用这种用于机器学习的方法是困难的。在这项工作中,我们仔细分析了这些约束,并提出了一种专门用于协作培训的新型算法框架。我们展示了我们在现实条件下的SWAV和Albert预先预价的方法的有效性,并在成本的一小部分中实现了与传统设置相当的性能。最后,我们提供了一份成功的协作语言模型预先追溯的详细报告,有40名参与者。
translated by 谷歌翻译
共识算法通过使多个机器人能够收敛到仅使用本地通信的全局变量的一致估计来构成许多分布式算法的基础。但是,标准共识协议可以轻松地由非合作团队成员误入歧途。因此,对于设计弹性分布式算法是必要的,对共识的弹性形式的研究是必要的。 W-MSR共识是一种这样的有弹性共识算法,它允许仅具有通信图的本地知识,而没有用于共享数据的先验模型。但是,给定通信图满足严格的图形连接要求的验证使W-MSR在实践中难以使用。在本文中,我们显示了机器人文献中常用的通信图结构,即基于Voronoi Tessellation构建的通信图,自动产生足够连接的图以拒绝单个非合作团队成员。此外,我们展示了如何增强该图,以拒绝两个非合作团队成员,并为修改进一步的弹性提供路线图。这项贡献将允许在已经依赖基于Voronoi的通信(例如分布式覆盖范围和探索算法)的算法中轻松应用弹性共识。
translated by 谷歌翻译
从几个培训示例中不断学习新课程,而不忘记以前的旧课程需要一个灵活的体系结构,而不可避免地会增加部分存储,其中可以逐步存储并有效地检索新的示例和类。一个可行的架构解决方案是将固定的深神经网络紧密融合到动态发展的明确记忆(EM)。作为该体系结构的核心,我们提出了一个EM单元,该单元在持续学习操作过程中利用节能中的内存计算(IMC)核心。我们首次证明了EM单元如何使用基于IMC Core上的操作(PCM)上的IMC核心操作,在推理期间进行了多个训练示例,扩展以适应看不见的类并进行相似性搜索。具体而言,通过PCM设备的原位进行性结晶实现了一些编码训练示例的物理叠加。与不断学习的最新完整精确基线软件模型相比,IMC核心上达到的分类精度在1.28% - 2.5%范围内保持在2.5%之内。在60个旧课程的顶部,新颖的课程(每班只有五个示例)。
translated by 谷歌翻译